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Key Takeaways 


Evolution of data architecture 


Data Lakehouse 


Architectural deep dive: Iceberg 


How queries work under covers? 


Design Benefits 
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Evolution of Data Architecture 


How did we get here? 


Centralized, reliable data platform 
Democratize data 


Data Warehouse -> Data Lakes -> Lakehouse 
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Iceberg in a Data Lakehouse 


Users & "m ed 
. . nkt Sr Jupyter 
Applications + supéiser X 
Interfaces Arrow Flight, ODBC/JDBC 


KZ Table Format Apache Iceberg 
File Formats Parquet, ORC, csv, json 


EL OSEE 
Data Lake Storage 


dremio 


What is a table format? 


e Away to organize a dataset's files to present them as a single “table” 


e Away to answer the question “what data is in this table?" 


dbl.tablel 


Old Way A table's contents is all files 
(Hive) in that table's directories 
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Hive table format db1.table1 


A table's contents is all files in that table's directories 


The old de-facto standard 


Pros Cons 


e Works with basically every engine e Smaller updates are very inefficient 

ERI DEER Oe lero e No way to change data in multiple partitions safely 
standard for so long 
e |n practice, multiple jobs modifying the same 


e More efficient access patterns than 
p une P dataset don't do so safely 


full-table scans for every query 

! . e All of the directory listings needed for large tables 

e File format agnostic : 

take a long time 

Atomicall te a whol rtition : 

á puo rol partitio e Users have to know the physical layout of the table 

e Single, central answer to “what data 
is in this table” for the whole 


ecosystem 


e Hive table statistics are often stale 
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How can we resolve these issues? 


+ We needa new table format 


dbl.table1 


D diri 


A table's contents is all files 
in that table's directories 


dbl.tablel 
A table is a canonical list of 
files 


NETFLIX s goals 


Table correctness/consistency 
Faster guery planning and 
execution 

Allow users to not worry about the 
physical layout of the data 

Table evolution 


Accomplish all of these at scale 
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What Iceberg is and isn't 


V 


= Table format specification = A storage engine 


= A set of APIs and libraries for interaction with that = An execution engine 


specification . 
= A service 


o These libraries are leveraged in other 
engines and tools that allow them to interact 
with Iceberg tables 
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Architectural Deep Dive 


Iceberg table format 


Iceberg Catalog 
dbl.tablel | 
current metadata pointer 


catalog layer 


metadata layer 


metadata file metadata file 


manifest 
list 

A 
file 


2.7 manifest 
file file 


data layer À 
data files E xin data files 5 


Overview of the components 
Summary of the read path (SELECT) 
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Iceberg components: Catalog 


Iceberg Catalog 


oO PN 
dbl.tablel 2 
current metadata pointer 


Iceberg Catalog 


= A store that houses the current 
metadata pointer for Iceberg tables 

= Must support atomic operations for 
updating the current metadata pointer 
(e.g. HDFS, HMS, Nessie) 


tablel's current metadata pointer 


= Mapping of table name to the location 
of current metadata file 


file 
mo RR KT KR 


data layer x N CA — VE 
data files | data files | data files | 


13 


Iceberg components: Metadata File 


Iceberg Catalog 
dbl.tablel | 
current metadata pointer 


table at a certain point in time 


E e js —iglalclalalelziole/sicicizioie "table-uuid" : "<uuid>", 
metadata layer "location" : "/path/to/table/dir", 
"schema": {...}, 
metadata file "partition-spec": [ {<partition-details>}, 


'current-snapshot-id": <snapshot-id>, 
"snapshots": [ { 
"snapshot-id": <snapshot-id> 


manifest 
list 


£ (Es 
data files | 


LII: 
data files | 


data layer 


data files 


Metadata file - stores metadata about a 


manifest-list": "/path/to/manifest/list.avro" 
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Iceberg components: Manifest List 


Iceberg Catalog 
dbl.tablel | 
current metadata pointer 


metadata layer 


metadata file metadata file 


"manifest-path" 


E "partitions": [ 
manifest 


list 


"manifest-path" 


"partitions": [ 


data layer 


I = — 


CD CON CN 
a 


data files 


"added-snapshot- 
"partition-spec- 


"added-snapshot- 
"partition-spec- 


Manifest list file - a list of manifest files 


"/path/to/manifest/file.avro", 
id": <snapshot-id>, 
id": <partition-spec-id>, 
{partition-info}, ...], 


"/path/to/manifest/file2.avro", 
id": <snapshot-id>, 
id": <partition-spec-id>, 
{partition-info}, ...], 
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Iceberg components: Manifest file 


Iceberg Catalog 
dbl.tablel | 
current metadata pointer 


Manifest file - a list of data files, along with 
details and stats about each data file 


"data-file": { 


metadata layer "file-path": "/path/to/data/file.parquet", 
"file-format": "PARQUET", 
metadata file metadatanie "partition": {"<part-field>":{"<data-type>":<value>}}, 


"record-count": <num-records>, 


"null-value-counts": [{ 


"column-index": "1", "value": 4 
ah 

"lower-bounds": [{ 
"column-index": "1", "value": "aaa" 


z weeds 
"upper-bounds": [{ 
"column-index": "1", "value": "eee" 


manifest 
list 
v — 7 
manifest 
file 
NS 


manifest 
list 


p". edele 


data layer 


LIII: 
data files | 


data files 


Let's Look Under the Covers 
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CREATE TABLE 


CREATE TABLE dbl.tablel ( 
order id bigint, 


customer id bigint, 
order amount DECIMAL(10, 
order ts TIMESTAMP 


) 
USING iceberg 
PARTITIONED BY ( hour(order ts) 


dbl.tablel -> 
tablel/metadata/vl.metadata.json 


tablel/ 
|- metadata/ 
| |- vl.metadata.json 
|- data/ 


2), 


yi 


Iceberg Catalog 
dbl.tablel | 
current metadata pointer 


metadata layer 


metadata file 


data layer 
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INSERT 


INSERT INTO dbl.tablel VALUES 
123, 
456, 
30:17; 
‘2021-01-26 08:10:23’ 


dbl.tablel -> 


tablel/metadata/v2.metadata.json 


tablel/ 

- metadata/ 

- vl.metadata.json 

- v2.metadata.json 

- snap-2938-1-4103.avro 

- d8f9-ad19-4e.avro 

- data/ 

- order ts hour=2021-01-26-08/ 
[- 00000-5-cae2d.parquet 


Iceberg Catalog 
dbl.tablel 
current metadata pointer 


metadata layer 


metadata file metadata file 


manifest 
list 

manifest 
file 


data file 


data layer 
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UPSERT 


M 
U 
ON tablel.order id = s.order id 
W 
W 


INTO dbl.tablel 


kep! 
I 


dbl.tablel -> 


table 


/metadata/v3.metadata.json 


tablel/ 
- metadata/ 
- v3.metadata.json 
- snap-29c8-1-b103.avro 
= Enap-9fal-3-16c3.avro 
- d8£9-ad19-4e.avro 
= pd9a-98fa-77.avro 
- data/ 
- order ts hour=2021-01-26-08/ 
[= i 5-cae2d.parquet 
|- 00000-1-aef71.parquet 
- order ts hour=2021-01-27-10/ 
|- 00000-3-0fa3a.parquet 


LECT * FROM tablel stage ) 


S 


EN MATCHED THEN UPDATE tablel.order amount 
EN NOT MATCHED THEN INSERT * 


s.order amount 


metadata layer 


data layer 


Iceberg Catalog 


dbl.tablel 
current metadata pointer 


metadata file 


metadata file 


manifest 
list 


manifest 
file 


data file 


manifest 
list 


manifest 
file 
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READ 


SELECT * 


ROM dbl.tablel 


HERE order ts = DATE ‘2021-01-26’ 


dbl.tablel -> 
tablel/metadata/v3.metadata.json 


table1/ 
- metadata/ 


|- vl.metadata.jsor 


- v2.metadata 
v3.metadata.json 
- snap-29c8-1-b103.avro 
snap-9fal-3-16c3.avro 
- d8£9-ad19-4e.avro 
0d9a-98fa-77.avro 
- data/ 
|- order ts hour=2021-01-26-08/ 


| - 00000-5-cae2d.parquet 

| 00000-1-aef71.parquet 

|= der ts hour-2021-01-27-10/ 
| 


|- 00000-3-0fa3a.parquet 


Iceberg Catalog 


dbl.tablel 
current metadata pointer 


metadata layer 


metadata file metadata file 


manifest 
list 


data file 


data layer 
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TIME TRAVEL 


SELECT * 
FROM dbl.tablel AS OF ‘2021-05-26 09:30:00’ 


-- (timestamp is from before MERGE INTO operation) 


Iceberg Catalog 


dbl.tablel 
current metadata pointer 


metadata layer 


metadata file metadata file 


manifest 
list 


manifest 
list 


tablel/ 
- metadata/ 


eci l - in ea 
snap-29c8-1-b103.avro datallayer 


snap-9fal-3-16c3.avro 
d8£9-ad19-4e O 
0d9a-98fa-77.avro 

- data/ 

= order ts hour=2021-01-26-08/ 


- order ts hour-2021-01-27-10/ 
[= 00000-3-Ofa3a.parquet 


21 dremio 


Case Study: Atlas 


e Historical Atlas data: 
o Time-series metrics from Netflix runtime systems 
o 1month: 2.7 million files in 2,688 partitions 
o Problem: cannot process more than a few days of 
data 


e Sample query: 


select distinct tags['type'] as type 
from iceberg.atlas 
where 

name = 'metric-name' and 

date > 20180222 and date <= 20180228 
order by type; 


Case Study: Atlas Performance 


e Hive table — with Parquet filters: 
o 400k+ splits per day, not combined 
o EXPLAIN query: 9.6 min (planning wall time) 


e Iceberg table — partition data filtering: 
o 15,218 splits, combined 
o 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning) 


e Iceberg table — partition and min/max filtering: 
o 412 splits 
o 42sec (wall time) / 22 min (task time) / 25 sec 
(planning) 


Enabled by this design: Compaction 


=  Asynchronously compact small files into fewer larger files 
= |t being asynchronous helps balance the write-side and read-side trade-offs 
= Input and output of compaction jobs can be different file types 
o E.g. avro from streaming writes, compacted into larger parquet files for analytics 
=  Scheduling/triggering and the actual compaction work is done by external tools 
as Scheduling/triggering: scheduler, workflow tool, etc. 


e Compaction work execution: processing engine (e.g. Spark, Dremio) 
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Design Benefits 


= Efficiently make smaller updates = Abstract the physical, expose a logical view 
> Make changes at the file level o Hidden partitioning 
= Snapshot isolation for transactions a Compaction 


o Reads and writes don't interfere with each other 


c É 2 Tables can change over time 
and all writes are atomic 


Á Concurrent writes H DE can transparently experiment with table layout 
- Faster planning and execution = Rich schema evolution support 
2 List of files defined on the write-side = All engines see changes immediately 


o Column stats in manifest files used to eliminate files 
= Reliable metrics for CBOs (vs hive..) 


a Done on write instead of “infrequent” expensive 
read job 


25 dremio 


26 


Additional Resources 


È iceberg.apache.org 
È iceberg.apache.org/blogs/ 
= dremio.com/subsurface/apache-iceberg/ 


Ç Get hands on - iceberg.apache.org/getting-started 
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Thank You 
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